151 research outputs found
LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS
applications into daily life, where models are deployed in cloud to provide
services for customs. Among these models are diffusion probabilistic models
(DPMs), which can be stably trained and are more parameter-efficient compared
with other generative models. As transmitting data between customs and the
cloud introduces high latency and the risk of exposing private data, deploying
TTS models on edge devices is preferred. When implementing DPMs onto edge
devices, there are two practical problems. First, current DPMs are not
lightweight enough for resource-constrained devices. Second, DPMs require many
denoising steps in inference, which increases latency. In this work, we present
LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight
U-Net diffusion decoder and a training-free fast sampling technique, reducing
both model parameters and inference latency. Streaming inference is also
implemented in LightGrad to reduce latency further. Compared with Grad-TTS,
LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency,
while preserving comparable speech quality on both Chinese Mandarin and English
in 4 denoising steps.Comment: Accepted by ICASSP 202
ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs
In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding
Prompt-and-Refine strategy (Figure 3), two simple but effective
\textbf{training-free} methods to decrease the Token Display Time (TDT) of
streaming ASR models \textbf{without any accuracy loss}. The core idea of
ZeroPrompt is to append zeroed content to each chunk during inference, which
acts like a prompt to encourage the model to predict future tokens even before
they were spoken. We argue that streaming acoustic encoders naturally have the
modeling ability of Masked Language Models and our experiments demonstrate that
ZeroPrompt is engineering cheap and can be applied to streaming acoustic
encoders on any dataset without any accuracy loss. Specifically, compared with
our baseline models, we achieve 350 700ms reduction on First Token
Display Time (TDT-F) and 100 400ms reduction on Last Token Display Time
(TDT-L), with theoretically and experimentally equal WER on both Aishell-1 and
Librispeech datasets.Comment: accepted by interspeech 202
Evolution of kinematic transformation from the Altyn Tagh fault to the Qilian Shan in the northern Tibetan Plateau: from early Cenozoic initiation to mid-Miocene extrusion
The Altyn Tagh fault has been a crucial tectonic boundary of the Tibetan Plateau during the Cenozoic India-Eurasia collision. However, issues have not been addressed regarding the Cenozoic evolution of the kinematic transformation from the eastern Altyn Tagh fault to the Qilian Shan. Here we focus on the kinematics at a crucial point, the Subei triple junction, along the Altyn Tagh fault, which was recorded by faulting in the Suganhu basin to the south of the junction. We reconstructed the structural pattern of faults and thickness distribution of the Cenozoic strata in the Suganhu basin by integrating seismic profiles, well logging, and topographic data. We inferred that only crustal shortening and thickening in the Danghenan Shan, a prominent topographic high, absorbed the strike-slip displacement along the Altyn Tagh fault during the early Cenozoic. Since the mid-Miocene, strike-slip fault belts within the Suganhu basin were initiated, based on the fault geometry and uneven thickness distribution across the fault belts. We thus proposed a mid-Miocene kinematic transformation realized by blocks extruding southeastward, as well as the crustal shortening and thickening in the entire Qilian Shan. Those blocks are bounded by preexisting weaknesses with lateral movements, and lithospheric heterogeneity played an essential role in the block-scale extrusion
Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames
Recently, the unified streaming and non-streaming two-pass (U2/U2++)
end-to-end model for speech recognition has shown great performance in terms of
streaming capability, accuracy and latency. In this paper, we present
fast-U2++, an enhanced version of U2++ to further reduce partial latency. The
core idea of fast-U2++ is to output partial results of the bottom layers in its
encoder with a small chunk, while using a large chunk in the top layers of its
encoder to compensate the performance degradation caused by the small chunk.
Moreover, we use knowledge distillation method to reduce the token emission
latency. We present extensive experiments on Aishell-1 dataset. Experiments and
ablation studies show that compared to U2++, fast-U2++ reduces model latency
from 320ms to 80ms, and achieves a character error rate (CER) of 5.06% with a
streaming setup.Comment: 5 pages, 3 figure
TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty
In this paper, we present TrimTail, a simple but effective emission
regularization method to improve the latency of streaming ASR models. The core
idea of TrimTail is to apply length penalty (i.e., by trimming trailing frames,
see Fig. 1-(b)) directly on the spectrogram of input utterances, which does not
require any alignment. We demonstrate that TrimTail is computationally cheap
and can be applied online and optimized with any training loss or any model
architecture on any dataset without any extra effort by applying it on various
end-to-end streaming ASR networks either trained with CTC loss [1] or
Transducer loss [2]. We achieve 100 200ms latency reduction with equal
or even better accuracy on both Aishell-1 and Librispeech. Moreover, by using
TrimTail, we can achieve a 400ms algorithmic improvement of User Sensitive
Delay (USD) with an accuracy loss of less than 0.2.Comment: submitted to ICASSP 202
- …